109 research outputs found
A clustering method for robust and reliable large scale functional and structural protein sequence annotation
Bioinformatics, in the last few decades, has played a fundamental role to give sense to the huge amount of data produced. Obtained the complete sequence of a genome, the major problem of knowing as much as possible of its coding regions, is crucial. Protein sequence annotation is challenging and, due to the size of the problem, only computational approaches can provide a feasible solution. As it has been recently pointed out by the Critical Assessment of Function Annotations (CAFA), most accurate methods are those based on the transfer-by-homology approach and the most incisive contribution is given by cross-genome comparisons. In the present thesis it is described a non-hierarchical sequence clustering method for protein automatic large-scale annotation, called âThe Bologna Annotation Resource Plusâ (BAR+). The method is based on an all-against-all alignment of more than 13 millions protein sequences characterized by a very stringent metric. BAR+ can safely transfer functional features (Gene Ontology and Pfam terms) inside clusters by means of a statistical validation, even in the case of multi-domain proteins. Within BAR+ clusters it is also possible to transfer the three dimensional structure (when a template is available). This is possible by the way of cluster-specific HMM profiles that can be used to calculate reliable template-to-target alignments even in the case of distantly related proteins (sequence identity < 30%).
Other BAR+ based applications have been developed during my doctorate including the prediction of Magnesium binding sites in human proteins, the ABC transporters superfamily classification and the functional prediction (GO terms) of the CAFA targets. Remarkably, in the CAFA assessment, BAR+ placed among the ten most accurate methods. At present, as a web server for the functional and structural protein sequence annotation, BAR+ is freely available at http://bar.biocomp.unibo.it/bar2.0
The RING 2.0 web server for high quality residue interaction networks
open3noopenPiovesan, Damiano; Minervini, Giovanni; Tosatto, Silvio c.E.Piovesan, Damiano; Minervini, Giovanni; Tosatto, Silvi
MobiDB-lite 3.0: fast consensus annotation of intrinsic disorder flavors in proteins
Abstract
Motivation
The earlier version of MobiDB-lite is currently used in large-scale proteome annotation platforms to detect intrinsic disorder. However, new theoretical models allow for the classification of intrinsically disordered regions into subtypes from sequence features associated with specific polymeric properties or compositional bias.
Results
MobiDB-lite 3.0 maintains its previous speed and performance but also provides a finer classification of disorder by identifying regions with characteristics of polyolyampholytes, positive or negative polyelectrolytes, low-complexity regions or enriched in cysteine, proline or glycine or polar residues. Subregions are abundantly detected in IDRs of the human proteome. The new version of MobiDB-lite represents a new step for the proteome level analysis of protein disorder.
Availability and implementation
Both the MobiDB-lite 3.0 source code and a docker container are available from the GitHub repository: https://github.com/BioComputingUP/MobiDB-lit
Recommended from our members
Critical assessment of protein intrinsic disorder prediction.
Intrinsically disordered proteins, defying the traditional protein structure-function paradigm, are a challenge to study experimentally. Because a large part of our knowledge rests on computational predictions, it is crucial that their accuracy is high. The Critical Assessment of protein Intrinsic Disorder prediction (CAID) experiment was established as a community-based blind test to determine the state of the art in prediction of intrinsically disordered regions and the subset of residues involved in binding. A total of 43âmethods were evaluated on a dataset of 646âproteins from DisProt. The best methods use deep learning techniques and notably outperform physicochemical methods. The top disorder predictor has Fmaxâ=â0.483 on the full dataset and Fmaxâ=â0.792 following filtering out of bona fide structured regions. Disordered binding regions remain hard to predict, with Fmaxâ=â0.231. Interestingly, computing times among methods can vary by up to four orders of magnitude
DisProt 7.0 : a major update of the database of disordered proteins
Erratum: Nucleic Acids Res (2017) 45 (D1): D1123-D1124.The publishers would like to apologise for a mistake in the affiliation of one of the authors, Salvador Ventura. The correct affiliation is Departament de Bioquimica i Biologia Molecular and Institut de Biotecnologia i Biomedicina, Universitat AutĂČnoma de Barcelona, Bellaterra 08193, Spain. This has now been corrected online
The human "magnesome": detecting magnesium binding sites on human proteins
BACKGROUND: Magnesium research is increasing in molecular medicine due to the relevance of this ion in several important biological processes and associated molecular pathogeneses. It is still difficult to predict from the protein covalent structure whether a human chain is or not involved in magnesium binding. This is mainly due to little information on the structural characteristics of magnesium binding sites in proteins and protein complexes. Magnesium binding features, differently from those of other divalent cations such as calcium and zinc, are elusive. Here we address a question that is relevant in protein annotation: how many human proteins can bind Mg(2+)? Our analysis is performed taking advantage of the recently implemented Bologna Annotation Resource (BAR-PLUS), a non hierarchical clustering method that relies on the pair wise sequence comparison of about 14 millions proteins from over 300.000 species and their grouping into clusters where annotation can safely be inherited after statistical validation. RESULTS: After cluster assignment of the latest version of the human proteome, the total number of human proteins for which we can assign putative Mg binding sites is 3,751. Among these proteins, 2,688 inherit annotation directly from human templates and 1,063 inherit annotation from templates of other organisms. Protein structures are highly conserved inside a given cluster. Transfer of structural properties is possible after alignment of a given sequence with the protein structures that characterise a given cluster as obtained with a Hidden Markov Model (HMM) based procedure. Interestingly a set of 370 human sequences inherit Mg(2+ )binding sites from templates sharing less than 30% sequence identity with the template. CONCLUSION: We describe and deliver the "human magnesome", a set of proteins of the human proteome that inherit putative binding of magnesium ions. With our BAR-hMG, 251 clusters including 1,341 magnesium binding protein structures corresponding to 387 sequences are sufficient to annotate some 13,689 residues in 3,751 human sequences as "magnesium binding". Protein structures act therefore as three dimensional seeds for structural and functional annotation of human sequences. The data base collects specifically all the human proteins that can be annotated according to our procedure as "magnesium binding", the corresponding structures and BAR+ clusters from where they derive the annotation (http://bar.biocomp.unibo.it/mg)
MobiDB 3.0: more annotations for intrinsic disorder, conformational diversity and interactions in proteins
The MobiDB (URL: mobidb.bio.unipd.it) database of protein disorder and mobility annotations has been significantly updated and upgraded since its last major renewal in 2014. Several curated datasets for intrinsic disorder and folding upon binding have been integrated from specialized databases. The indirect evidence has also been expanded to better capture information available in the PDB, such as high temperature residues in X-ray structures and overall conformational diversity. Novel nuclear magnetic resonance chemical shift data provides an additional experimental information layer on conformational dynamics. Predictions have been expanded to provide new types of annotation on backbone rigidity, secondary structure preference and disordered binding regions. MobiDB 3.0 contains information for the complete UniProt protein set and synchronization has been improved by covering all UniParc sequences. An advanced search function allows the creation of a wide array of custom-made datasets for download and further analysis. A large amount of information and cross-links to more specialized databases are intended to make MobiDB the central resource for the scientific community working on protein intrinsic disorder and mobility
Best practices for the manual curation of Intrinsically Disordered Proteins in DisProt
The DisProt database is a significant resource containing manually curated
data on experimentally validated intrinsically disordered proteins (IDPs) and
regions (IDRs) from the literature. Developed in 2005, its primary goal was to
collect structural and functional information into proteins that lack a fixed
three-dimensional (3D) structure. Today, DisProt has evolved into a major
repository that not only collects experimental data but also contributes
significantly to our understanding of the IDPs/IDRs roles in various biological
processes, such as autophagy or the life cycle mechanisms in viruses, or their
involvement in diseases (such as cancer and neurodevelopmental disorders).
DisProt offers detailed information on the structural states of IDPs/IDRs,
including state transitions, interactions, and their functions, all provided as
curated annotations. One of the central activities of DisProt is the meticulous
curation of experimental data from the literature. For this reason, to ensure
that every expert and volunteer curator possesses the requisite knowledge for
data evaluation, collection, and integration, training courses and curation
materials are available. However, biocuration guidelines concur on the
importance of developing robust guidelines that not only provide critical
information about data consistency but also ensure data acquisition.This
guideline aims to provide both biocurators and external users with best
practices for manually curating IDPs and IDRs in DisProt. It describes every
step of the literature curation process and provides use cases of IDP curation
within DisProt.
Database URL: https://disprot.org
Critical assessment of protein intrinsic disorder prediction
Intrinsically disordered proteins, defying the traditional protein structureâfunction paradigm, are a challenge to study experimentally. Because a large part of our knowledge rests on computational predictions, it is crucial that their accuracy is high. The Critical Assessment of protein Intrinsic Disorder prediction (CAID) experiment was established as a community-based blind test to determine the state of the art in prediction of intrinsically disordered regions and the subset of residues involved in binding. A total of 43 methods were evaluated on a dataset of 646 proteins from DisProt. The best methods use deep learning techniques and notably outperform physicochemical methods. The top disorder predictor has F max = 0.483 on the full dataset and F max = 0.792 following filtering out of bona fide structured regions. Disordered binding regions remain hard to predict, with F max = 0.231. Interestingly, computing times among methods can vary by up to four orders of magnitude
- âŠ